General Disclaimer 


One or more of the Following Statements may affect this Document 


• This document has been reproduced from the best copy furnished by the 
organizational source. It is being released in the interest of making available as 
much information as possible. 


• This document may contain data, which exceeds the sheet parameters. It was 
furnished in this condition by the organizational source and is the best copy 
available. 


• This document may contain tone-on-tone or color graphs, charts and/or pictures, 
which have been reproduced in black and white. 


• This document is paginated as submitted by the original source. 


• Portions of this document are not fully legible due to the historical nature of some 
of the material. However, it is the best reproduction available from the original 
submission. 


Produced by the NASA Center for Aerospace Information (CASI) 



NASA Technical Memorandum 84501 


(SASA-TM-84 5 01) ADVA HCED BE1IABILITY B82-30962 

HOBBLING OP PAULT-T01IBAHI CGBBUTEB-BASID 
SYSTEMS (VASA) 33 p BC AQ3/Mf A01 CSCL 09B 

One las 

G3/61 28695 


Advanced Reliability Modeling of 
Fault-Tolerant Computer-Based Systems 


Salvatore J. Bavuso 


MAY 1982 


NASA 

National Aeronautics and 
Space Administration 

Langley Research Center 

Hampton. Virginia 23665 



ADVANCED RELIABILITY MODELING OF FAULT-TOLERANT 

COMPUTER-BASED SYSTEMS 

ORiGtl^i- Salvatore J. Bavuso 

ns POOR ‘ NASA Langley Research Center 

u Hampton, Virginia 23665/USA 


SUMMARY 


Digital fault-tolerant computer-based systems are on the verge of 
becoming commonplace In military and commercial avionics. These systems 
hold the promise of Increased availability, reliability, and maintainability 
over conventional analog-based systems through the application of replicated 
digital computers arranged In fault-tolerant configurations. Three tightly 
coupled factors of paramount importance which will ultimately determine the 
viability of these systems are reliability, safety, and profitability. 
Reliability, the major driver. Involves virtually every aspect of design, 
packaging, and field operations as regards safety, maintainability, and* 
invariably profit for commercial applications or rational security for 
military uses. 

The antithesis of promise for the digital computer Is, however, the 
Achilles' heel of the reliability engineer. The utilization of digital 
computer systems makes the task of producing a credible reliability 
assessment a formidable one. The root of the problem is embodied in the 
very essence that makes the digital computer such an outstanding device for 
use in a host of applications, namely its adaptability to changing 
requirements, computational power, and ability to test itself efficiently. 

It is the intent of this presentation to address the nuances of modeling the 
reliability of systems with large state sizes, in the "Markov" sense, which 
result from systems .nat are based on replicated redundant hardware and to 
discuss the modeling of numerous factors which can reduce reliability 
without concomitant depletion of spare hardware. The diminishing factors 
are captured by the popular "coverage" terminology. Advanced coverage 
(fault-handling) models are described with supporting rationale. Methods of 
acquiring and measuring parameters for these models are delineated, and some 
recently measured latent-fault data are presented. 


INTRODUCTION 


It is the intent of this paper to report on the development of two 
novel methodologies for the reliability assessment of fault-tolerant digital 
computer-based systems: Computer-Aided Reliability Estimation III and Gate 
Logic Software Simulation. Both technologies were developed to mitigate a 
serious weakness In the design and evaluation process of ultrareliable 


ORIGINAL PAGE *3 
OF POOR QUALITY 

digital systems. The weak link Is based on the unavailability of a 
sufficiently powerful modeling technique for comparing the stochastic 
attributes of one system against others. Some of the more Interesting 
attributes are reliability, system survival, safety, and mission success. 

A long-term goal of the NASA Langley Research Center Is the 
development of this tool. The technology development process Is shown In 
figure 1. Historically, our Interest In this subject commenced circa 1971. 
At that time, two math models were Identified as having potential for 
filling the assessment gap. Figure 1 shows those models as CARE, Computer 
Aided Reliability Estimation, a computer program generated at NASA's Jet 
Propulsion Laboratory for application to long-lived, space-borne computer 
systems; and TASRA, Tabular System Reliability Analysis, a computer program 
due to Battelle Memorial Laboratories for application to the F-lll Pitch 
Flight Control System (refs. 1 and 2). 

The CARE computer program Is a very powerful reliability assessment 
capability for fault-tolerant system concepts that existed In the late 
1960’s. A major Innovation In reliability modeling In CARE was the 
Incorporation of the stochastic concept of coverage due to Roth and 
Bouviclus, et al. (ref. 3). Coverage, defined as the conditional 
probability that a proper recovery occurs if a fault exists, was shown by 
Bourlclus and Carter, et al., to be a significant factor for achieving h1$t 
reliability in modular replacement systems (ref. 4). Prior to this 
consideration, reliability analyses omitted the coverage parameter entirely 
which caused math models to assume a unity probability of system recovery 
given a fault occurrence, thereby forcing the reliability predictions to be 
nonconservative and hence, inaccurate. Although powerful and innovative, 
the CARE math model suffers from two major deficiencies, its inflexibility 
to model the emerging multi processor -based systems and the lack of a model 
for computing the coverage parameter. 

The TASRA computer program, in contrast to CARE, utilizes the popular 
"Markov" analysis method which allows a very flexible modeling technique but 
lacks the vital coverage model as well. 

Based on these findings, NASA-Langley participated in the 
codevelopment of CARE II with the Raytheon Company (refs. 5 a.;d 6). The 
primary objective in creating CARE II was to develop a coverage model to 
compute coverage for CARE. Figure 2 presents the coverage math model and 
delineates the factors comprising the coverage computation. Although the 
CARE II coverage model represents a quantum leap in coverage modeling, CARE 
II still retains the original CARE system's architectural -descript ion 
inflexibility. 

A gestation period ensued following the CARE II development that 
involved Langley in numerous studies, depicted in figure 1 as square blocks, 
and the codevelopment of two new reliability assessment methodologies, i.e., 
CAST (Combined Analytic Simulative Technique) and CARSRA (Computer-Aided 
Redundant System Reliabil ity Analysis). The coverage impact study 
determined upper and lower bound values of coverage for a fault-tolerant 
triplex flight control computer system utilizing state of the art hardware 
(ref. 7). The CAST study made two important contributions to our program at 
Langley: it emphasized the potential importance of transient modeling in 


2 







ORIGINAL page 13 
OF POOR QUALITY 

reliability predictions, and It Introduced the notion of combining an 
analytical approach with computer simulation (ref. 8). By partitioning the 
modeling of ultrarel lable systems In this latter manner, an otherwise 
Intractable problem using either technique In toto, now becomes workable. 

The CAST concept has become the mainstay of our approach to reliability 
assessment since the completion of the CAST study, circa 1974. 

CARSRA was a spin-off from a Boeing Company study on the design of an 
Airborne Advanced Reconflgurable Computer System (ARCS)(ref. 9). The 
development and application of CARSRA was our first comprehensive 
Involvement with the assessment of complicated aircraft flight control 
systems. The complexity of flight control systems gave rise to the creation 
of CARSRA In two Important areas. Since CARSRA utilizes the Markov 
approach, a state reduction technique was required, and It became clear that 
the assessment technique must model stage dependencies In order to assess a 
variety of system confl guratlons which constitute continued mission 
success. The most familiar example of this application is system survival. 
In a redundant fault-tolerant system, there are many system configurations 
which will effect the proper system output; however, there are other system 
configurations that may be of interest in addition to system survival. 
Boeing, In the ARCS study, defined the term "Functional Readiness." It Is 
expressed as a time-dependent probability and Is applied to missions 
containing critical subtasks which will either be performed or not 
performed, depending on the operational redundancy level at the time of 
demand. Boeing cites, as an example, an aircraft automatic landing function 
for which a certain level of hardware redundancy is required before a 
landing may be initiated in poor visibility and weather conditions. CARSRA 
also benefited from its predecessors by incorporating a multi -coverage 
parameter' capability and an electrical transient modeling capability. 

Transient modeling proceeded in two directions: the stochastic 

estimation of Intermittent failures of computer piece-parts and the modeling 
of the effects of induced analog transients on digital circuitry (refs. 10 
and 11). The latter study is ongoing work that relates an analog transient 
source with a digital system's activity. The form of this relationship will 
be a stochastic model for input to a system reliability assessment model. 

Software reliability studies are an ongoing activity at Langley. The 
most recent completed study suggests the possibility of estimating software 
reliability through testing. Although still in the experimental stages, a 
methodology has been proposed and demonstrated that estimates probability of 
software error as a function of execution time and test trials (ref. 12). 

All of the activities depicted by figure 1 have culminated to form the 
basis for the development of CARE III. CARE III was codeveloped by Langley 
and the Raytheon Company and cosponsored by the U.S. Air Force Avionics 
Laboratory at the Wright-Patterson Air Force Base (refs. 13, 14, and 15). A 
summary of the salient features of CARE III is shown in figure 3. 


3 


* T 4 


ORIGINAL P.'^E 13 
OF POOR QUALITY 

CARE III - A General -Purpose Reliability Analysis and Pes1g> Tool for 
“ ~ Fault-Tolerant Systems 


CARE III was designed to model large ultrareliable replicated systcns 
Incorporating digital electronics. Examples of such systems are shown In 
figure 4 (refs. 16, 17, 18, and 19). The CARE III assessment process Is 
depicted In figure 5 and begins with an architectural description of an 
ultrarellable system. That description may be based on a conceptual model 
of the system, then CARE III Is used as a design tool; or, the system may be 
well-defined so that CARE III Is utilized as an analysis tool. In either 
case, the analyst generates a set of failure rates and probability density 
functions for the various failure and error mechanisms he wishes to Include 
In the analysis. A partial list of failure and error models Is delineated 
In figure 12. The Inclusion of any of these models will necessarily lower 
the system reliability estimate. The need to Include a model Is of course a 
function of the architectural structure. Its fault-handling mechanisms, and 
the magnitudes of the parameters In the model. The large choice of failure 
and error models Is provided to Increase the realism and credibility of the 
analysis. The models are user options In CARE III and may be omitted at the 
discretion of the analyst. Usually he will omit certain models after 
determining that they have a minimal effect. Some of this modeling 
information is used to define t.he system fault-handling model (s) which is 
required as a user Input, indicated by the state diagram In figure 5. The 
remainder of the failure and error modeling data Is entered as Fortran 
NAMELIST statements - examples will follow. 

Another important step in setting up CARE III input Is the generation 
of the system configuration and success criteria. Fault-tolerant systems 
are usually designed to have many hardware and functional combi nations that 
enable proper system operation. CARE III uses the powerful fault tree 
language to describe system failure configurations. In large systems, the 
number of success combinations can be very large, and for this reason CARE 
III uses the "unsuccess’ 1 or failure combinations instead. In a properly 
designed system, the number of failure combinations should be considerably 
less than the number of system success combinations, thus easing the 
computational task. The fault tree language provides an excellent medium 
for delineating the system failure combinations. 

Referring again to figure 5, the user-prepared data are Initially 
processed by the CARE III input subprogram, CAREIN, shown as the upper 
disk. It is essentially composed of the fault tree language program. The 
second CARE III subprogram, COVRGE, processes the fault-handling model (s) 
data and puts them into the form required by the third subprogram, CARE3, 
which performs the reliability computations. CARE III is written entirely 
in the Fortran IV language and currently executes on the Digital Lqulpment 
Corporation PDP-10 computer and on the Control Data Corporation CYBER 170 
series computers. The CARE III output data take two forms, graphical or 
tabular. In either case, the outputs of most interest are the total system 
reliability or system survival as a function of time and tv** vital 
components: probability of system failure due to hardware «tJundancy 
limitations (exhaustion of spares) and probability of system failure due to 
improper fault handling. In ultrareliable systems, the latter factor is the 
predominate cause of system failure (ref. 20). 


4 


ORlG'.tlAL PAGE IS 

OF POOR QUALITY 


An example of the CARE III assessment process Is given by figures 6 and 
7. Figure 6 Is a sketch of an Ultrarellable Fault-Tolerant Multiprocessor 
composed of 10 memory processor pairs which communicate with each other over 
5 Individual bus lines shown In the chart as a solid bus line. This system 
survives If at least 2 computers and 2 buses are operational. The analyst 
wants to compute the probability of system survival at 10 hours of 
mission time for this multiprocessor system and the probability of system 
failure at 10 hours due to spare hardware depletion and due to Improper 
system-fault handling. For this simple Illustrative examp ia, the analyst 
creates two fault trees and a state diagram for system-fault handling as 
depicted in figure 7. The System Fault Tree describes the system stage 
configurations that cause system failure. The computer stage Is comprised 
of up to 10 computers, each having an Identical failure rate. The bus stage 
Is composed of up to 5 buses, each having an Identical failure rate, most 
likely different than that of a computer. The OR gate In the System Fault 
Tree means that the system falls If a computer stage falls or a bus stage 
falls. A computer stage falls If less than 2 computers are operational, and 
a bus stage falls If less than 2 buses are operational. These conditions 
are described In the line beginning with "SSTAGES". This statement Is a 
Fortran NAMELIST statement. It says there are two stages, (NST6ES«2),1.e. , 
10 computers and 5 buses, (N»10,5), and the minimum for stage survival Is 2 
for each stage (M«2,2). The remainder of the line describes the form of 
output data requested. The fault tree description for CARE III Input Is 
shown under the heading, SYSTEM FAULT-TREE. It describes the gate 
Interconnections and the types of gates. There Is another cause of system 
failure that is Implicit in the SYSTEM FAULT TREE and that Is system failure 
due to single-point failures in either stage. Details of this model are 
discussed later. 

A unique modeling capability of CARE III is the incorporation of the 
effects of synergistic pairs of failures. In fault-tolerant systems, the 
system could contain many undetected (latent) failures which Individually 
would not cause system failure; however, certain groupings of failures that 
coexist may bring the system down. The CRITICAL-PAIR TREE enables the 
analyst to specify the conditions under which synergistic paired failures 
cause system failure. For this case, any two latent computer failures out 
of ten computers or any two latent bus failures out of five buses cause 
svstem failure. In practice, one would usually specify which paired 
failures cause system failure. The CRITICAL-PAIR TREE is described by the 
data listed under the heading, CRITICAL-FAULT PAIRS. The next step In the 
CARE III input process is the description of the FAULT -HANDLING MODEL. This 
simple state model is composed of two system states, active (A) and active 
detected (AD). The active system state is entered when a failure occurs. 

It is an undetected or latent state. If a fault detector is employed, 6 is 
the rate at which failures are purged from the system. CARE III assumes 
that if the system enters the active detected state and it has spare 
hardware, it will reconfigure out the faulty module and the system 
recovers. Note, as 6 increases, the probability of synergistic failures 
occurring diminishes since there will be fewer latent failures present. 

Line one, $FLTTYP, shows that there is one fault model (NFTYPS a l) and 
defines the value of 6 as 3.6 x 10* detections per hour. Line 3, $FLTCAT, 
simply links failure rates denoted as RLM arrays to stages (JTYP). Line 5, 
$RNTIME, specifies flight time of 10 hours. CARE III input data for this 


5 


- * i 



example system Is shown In figure 7 beginning with the statement $FLTTYP and 
Including all the statements that follow. The CARE III output Is the total 
system probability of failure, the system probability of failure due to 
improper fault handling, and the probability of system failure due to spares 
exhaustion. 


User-Oriented Language for Describing Complex System Failure 
Configurations (Fault Tree) 


The multiprocessor example made use of a trivial application of the CARE 
III fault tree language. A better example emphasizing the power of the 
fault tree Input is given by figures 8 and 9. Figure 8 shows a block 
diagram of a proposed fault-tolerant flight-control system. Of particular 
Interest Is the Pitch Augmentation Stability (PAS) short cycle function. 

The system fault tree for this function Is presented in figure 9. This tree 
illustrates that not only hardware redundancy can be represented but 
functional redundancy as well. The elevator math model Is functionally 
redundant to the secondary actuator: . The melding of hardware and 
functional redundancy is a common practice in aircraft design. The proper 
entry of this fault tree into CARE III with the necessary failure rate $nd 
fault-handling data would yield a prediction of the probability of loss of 
PAS function as a function of mission time. For the uninitiated, figure 9 
is read as follows: An output from logic OR gate 212 constitutes loss of 

PAS function which can occur if an output from OR gate 211 occurs, or if an 
output from gate 210 occurs, or both. Gate 210 yields an output if at least 
3 out of 4 secondary actuators or actuator function (elevator math model) 
fall. Secondary actuator A will fail if computer A fails, or actuator A 
fails, or both. A similar description can be used to delineate failures due 
to loss of computation or loss of sensors. 


Fault-Handling Model Based on Probabilistic Description of Operative 
Detection, Isolation, and Recovery Mechanisms 


In figure 7, a simple fault-handling model of two states was 
described. CARE III has both a single-fault model and a double-fault 
modal. The latter defines critically coupled, paired failures. The single- 
fault model is given in figure 10 and is shown in the dashed box. For 
illustrative purposes, three additional states have been added so that the 
state diagram is a Markov model of a 2 unit system. Initially, the system 
is in state 0 and has experienced no failures. When a failure occurs, the 
system enters state A, the Active latent state, given by the arrival 
density, X (t). Depending upon the nature of the failure, i.e., 
permanent, transi ent , intermittent, etc., the fault-handling model will be 
defined differently. For example, if the failure were intermittent, X(t) 
would be the probability density function (PDF) for the arrival of an 
intermittent, and states A and B define the intermittent model where o and 6 
are constant transition rates into and out of state B, respectively. When 
the system is in state B, the Benign state, the failed unit appears to have 
healed itself, i.e., the manifestation of the failure, a fault, vanishes; 


6 


? 


iS 


ORIGINAL 
OF POOR QUAUTt 


however, when the failed manifestation Is once again resumed (the fault 
reappears), the system enters state A where at that point, the failure looks 
like a hard failure. It could be detected by a self-test program with PDF 
6(t'), and the system would enter state AD, the Active Detected state ; where 
given that a spare exists, the system will purge the faulty unit and switch 

In the spare. Or while In the active state, the fault could generate errors 

with PDF p(t'). The system then will enter the AE, Active Error state. The 
Intermittent .failure could manifest Its Intermittent state again so the 
system would then enter state BE, the Benign Error at ate. Although the 
failure Is benign, the error may not be benign and may cause system failure 
which Is denoted by the BE to F transition C(l-C)e(x)j. The error detection 
density e(x) and 1-C Is the proportion of errors from which the system Is 
unable to recover. While In state BE, the error could be detected and 
corrected. In this event, the system enters state BD (Benign Detected) by 

transition cc(t ). At this point, the system may choose to do nothing 

further with the detected error and so move to the Benign state, or the 
system may choose to reconfigure out the module containing the error and, 
therefore, move to state 1. The other transition out of state AE Is to 
state F, the single point failure transition [(l-c)e(x)]. This transition 
Is similar to the BE to F transition. In a well-designed fault- tolerant 
system, (l-C)e(x) should be near zero In magnitude. If A(t) were the PDF 
for the arrival of a transient, a would be set to a value greater than zero 
and B would be equal to zero. The PDF A(t) for the arrival of a permanent 
failure would be defined so that a«£«0. The dashed arc going from state AD 
to A, enables the analyst to Include the effects of the system decision that 
the detected fault which took the system from state A to AD was. In fact, a 
transient. In this regard, the system would not reconfigure out a 
non-failed module. A judicious choice of values for the single-fault model 
affords the analyst a wide range of models. A different fault model may be 
assigned to each stage or several models may be assigned to a given stage to 
cover the effects of different failure mechanisms such as transients, 
intermittents, hard failures, etc. 


The reader will note that the reliability model In figure 10 has three 
measures of time associated with it which necessarily makes the model a 
semi-Markov process. This added complexity Is required because the behavior 
of the system is dependent on the onset of the various fault-behavior 
events. 


LARGE REDUCTION OF SYSTEM STATE SIZE 


The capability thus described comes at no small computational price If 
state of the art techniques were employed. In fact, if one were to utilize 
the popular "Markov" modeling technique on a nontrivial system such as the 
flight control system shown in figure 11 (which Is composed of 22 stages and 
64 reconfigurable modules) coupled with a reasonable set of failure and 
error models (some of which are delineated in figure 12), the number of 
system states would be on the order of millions. For each state, a linear 
differential equation is formed. Clearly the solution of millions of 
differential equations is computationally intractable if not Impossible with 
today's technology. But CARE III was designed to assess these types of 
systems. How does it do it? 


7 





To understand the CARE III state reduction method. It Is expedient to 
first examine how the state size build-up occurs In the Markov method. A 
Markov state Is described as an ordered n-tuple. The components of the 
n-tuple contain Information about the number of failed reconflgurable units 
In the system plus system fault-handling Information for each module and 
fault type (hard failure, transient, etc.). For the system shown in figure 
11, the n-tuple has a minimum of 22 components, l.e., one for each stage. 

For each stage, additional n-tuple fault-handling components are added to 
describe the number of failed units that are system detected, the number 
that are Identified with a reconflgurable module, and the number that have 
been recovered. A set of fault-handling components Is Included In the 
n-tuple for each type of failure, e.g., transient, hard. Intermittent, etc. 
The total number of n-tuple components becomes very large. The product of 
the n-tuple components gives the number of possible system states. In 
contemporary practice, tractable analyses are accomplished by making 
numerous assumptions about the system to reduce the state size to the order 
of 1000. CARE III, on the other hand, retains a considerable amount of 
detail without the burden of unmanageable state sizes. This feat Is 
accomplished In CARE III by separating fault-handling Information from the 
structure model, l.e., information about the number of failed units. Each 
model Is worked separately to a point and then recombined (ref. 21). An 
example of this state reduction is depicted by figures 10 and 13. When CARE 
III processes the fault-handling model of figure 10, that Information Is 
(Tupped Into time-varying transition rates, Xj(t), A 2 (t), as shown In figure 
13 . l-Jhat might have been a stationary semi-Markov process for the system of 
figure 10 will always become a nonstationary Markov process. For large 
systems, state size reductions of at least 10,000 to 1 have been estimated. 
The solution to the nonstationary process model of figure 13 is given by the 
solution to the forward Kolmogorov equation depicted in figure 14. The 
system reliability is computed by summing the probabilities, Pi(t), for the 
allowable or success states. Numerically, it is more accurate to compute 
the probability of system failure in lie-, of reliability (probability of 
system survival). The user-defined fault trees ?ipecify the system failure 
states, so that, the probability of system failure is simply the sum of 
P.l(t) over 5,, the set of system failure states. CARE III actually computes 
the probability of system failure using the equation. 



0 


where the probability of system failure is given by the sum of QA(t) over A , 
the set of system failure states. 

CARE III has entered the first stage of validation by undergoing 
extensive testing at the computer program (debugged) and math model levels. 
It is also being applied to several experimental ultrareliable design 
concepts to evaluate CARE III modeling flexibility and the user-oriented 
fault tree interface. 


8 


ORIGINAL PAGE 13 
OF POOR QUALITY 


GLOSS-GATE LOGIC SOFTWARE SIMULATION 


It Is one thing to Implement a very powerful reliability model and 
quite another to make It useful. For all reliability evaluators. Including 
CARE III, a weakness lies In the unavailability of data for many of the 
fault-handling parameters. The situation Is not a total loss; however, 
since reasonable engineering estimates can be made In many cases, ana 
furthermore, the sensitivity of the system reliability can be tested against 
variations In the marginal data. A better way, of course, Is to measure or 
estimate the parameters based on some empirical observations. 


LATENT-FAULT MODELING AND MEASUREMENT METHODOLOGY 


Since system fault detection appears to be the most critical 
fault-handling parameter, NASA-Langley In 1977 Initiated a series of studies 
to Investigate a methodology for measuring the fault latency of digital 
computers (ref. 22). The methodology consisted of simulating a 1000 
equivalent gate processor In a host CDC Cyber 173 computer. The simulated 
processor was a paper design and is referred to as a "hypothetical" 
machine. The hypothetical machine was simulated at the gate level. 

Actually, two copies of the hypothetical machine executed Identical code In 
synchronism, where one machine received a stuck-at fault at the onset of the 
computation. Detection or nondetection was determined after the nonfauited 
processor completed its execution. At that time, the computational results 
of the two simulated machines were compared, bit for bit. Any difference 
constituted detection. If no detection occurred, the code's Input variables 
were randomly altered, and the processes were repeated for the same fault. 
This scheme was repeated for up to eight executions for the same fault, if 
detection didn't occur. If a detection occurred In less than or equal to 
eight repetitions, or no detection occurred after eight repetitions, then a 
rrw trial began where another stuck-at fault was induced. This overall 
process was repeated for up to 1000 randomly selected faults. The 1000 
induced faults were selected as a function of piece- part failure rates and 
were distributed equally across the nodes of the gates. The latency time, 
•].«., time to fault detection is expressed in number of code executions or 
repetitions. The time scale can easily be mapped into CPU seconds of code 
execution, if desired. 

The comparison of output data from two or more computers Is often 
referred to as a comparison-monitoring detector which is an Important 
detection mechanism employed in many operational fault-tolerant systems. In 
the CARE III fault-handling model shown in figure 10, comparison-monitoring 
detection Is modeled by e(t). 

The results cf the pilot study were both surprising and Intriguing. 
Using six different programs ranging from a very simple fetch-and-store 
program to a very complex linear convergence scheme, the pilot study showed 
that only 50 percent of the induced faults were detected after eight 
repetitions for all six programs. Figure 15 depicts typical results. The 


ORIGINAL PAGE £5 
OF POOR QUALITY 


Implication that these results have on reliability assessment for highly 
reliable systems Is staggering. It suggests that highly reliable 
fault-tolerant systems cannot be designed with comparison monitoring or 
majority voting as the major stuck-at fault detector (ref* 7). 


VERIFICATION OF LATENT-FAULT MEASUREMENT METHODOLOGY 


It was with this concern that a series of further experiments to 
Investigate the validity of the pilot study results were designed at 
NASA-Langley After all. It was not clear that similar results could be 
obtained for a real processor executing practical software. The goals of 
the follow-on work were to test the findings of the pilot stucty utilizing a 
real avionic mini processor, to assess the significance of Injecting faults 
at the gate level and at the functional pin level, to evaluate an airborne 
self-test program, and to account for undetected faults (refs. 22, 23, 24, 
and 25). The methodology for gate level simulation, which was codeveloped 
by NASA-Langley and Bendlx, Is called the GLOSS, Gate Logic Software 
Simulator. 

The pilot study results were tested In three phases using a gatq 
simulation of the Bendlx BDX-930 mini processor, a 5000 gate equivalent CPU. 
Initially the same six pilot stucty programs were coded using the comparable 
primitive instruction set of the hypothetical machine, l.e., load, store, 
add, subtract, and branch. The next phase allowed the six programs to be 
recoded using the rich Instruction set of the BDX-930, and finally 
comparison-monitoring detection was measured for flight control system code 
in lieu of the six pilot study programs. The surprising outcome of this 
experiment Is typified in figure 16 for all six programs. The percent of 
nondetected faults is about the same for all the programs. Instruction sets, 
and two different machines, i.e., 50 percent. As the code becomes more 
complex, the shape of the histogram bunches up so that virtually all the 
detection occurs in the first execution. The latency time decreases 
somewhat with increased code complexity but no 1, the percent detected. 

When the same set of experiments are repeated with the exception that 
faults are induced at the register transfer or pin level In lieu of the gate 
level, similar results shown by figure 17 appear. One notable difference, 
however. Is that the level of detection significantly rises. 

As an extension to the pilot study, the latent-fault measurement 
methodology was applied to an airborne self-test program consisting of 2000 
BOX-930 instructions which executed in three milliseconds on the BDX-930. 
While the simulator executed the self-test program, faults were Induced at 
the gate level and. In a separate experiment, at the pin level. The design 
goal for the self-test program was 95 percent detection. Figure 18 presents 
a summary of the self-test detection values and the comparison-monitoring 
detection values. For the same level of fault inducement, the self-test 
code shows the highest detection, but fell short of the 95 percent 
requirement for gate-level faults. With considerable effort and expense, 
the 87 percent self-test detection was increased to 94 percent, and appears 
to be a practical upper bound on gate-level fault detection. Flight control 
system code improved fault detection substantially but still fell short of 0 
percent undetected for gate-level faults. For component -level Injected 
faults, the industry-assumed value of 0 percent undetected was achieved. 


10 


ORIGINAL PAGE IS 
OF POOR QUALITY 

SOME PROFOUND RESULTS AND OBSERVATIONS 


The wide dispersion of detection raises some confounding questions 
about the method of fault Injection and, hence, which detection parameters 
to use in reliability assessments. The Inducement of faults at the gate or 
pin ievels yields a wide dispersion of detection when all other factors are 
equal. This concern Is further exacerbated by the knowledge that the method 
practiced by Industry, pin-level fault Injection, yields the higher 
detection values. At our present level of understanding of fault 
propagation mechanisms, the pin-level detection values would appear to be 
nonconservative and should be used with great caution, if at all. This 
recommendation Is based on our knowledge that the gate-level faults that 
were not detected after eight repetitions are potentially detectable or 
distinguishable, l.e., there exists some code or sequence of code execution 
that will propagate a distinguishable fault. 

In the process of Investigating the reason why faults were not detected 
after eight repetitions. It was discovered that there exists a class of 
faults that can never have an effect on the system and, therefore, can 
never be detected. This class of Indistinguishable faults has been 
estimated to comprise 16 percent of all faults. An example of an 
Indistinguishable fault Is a stuck-at fault located at the unused output of 
a flip-flop circuit. An Important outcome of this discovery regards the 
method of estimating detection coverage. The conservative approach, and the 
correct one, Is to delete the 20 percent Indistinguishable; from the set of 
induced faults In the computation of detection coverage. The net effect Is 
to reduce the magnitude of detection coverage. 

The lessons learned from these latent-fault modeling and measurement 
studies are summarized as follows: 

o Practical measurement of detection coverage for stuck-at faults Is 
possible and is a necessary aspect cf reliability assessment, 
o Comparison-monitoring detection for typical application code Is much 
less than expected, which poses serious Implications for highly 
reliable systems. 

o 95 percent gate-level self-test detection coverage Is measureable and 
achievable but expensive to accomplish, 
o The industry practice of measuring self-test detection by Inducing* 
faults at the pin level may not be conservative, and In view of the 
fact that the reliability of highly reliable systems Is very 
sensitive to detection, further analysis of this practice Is 
required. 

CONCLUDING REMARKS 


The CARE III and the GLOSS are presently In the developmental stages, 
with CARE III clearly in the lead. The CARE III math model Is embodied In a 
Fortran IV computer program that has been receiving considerable national 
scrutiny. The validation of CARE III is being conducted by Industry, the 


11 


t 


4 



university community, and by the U.S. Government at NASA's Langley Research 
Center and by the U.S. Air Force at Wright -Patterson Air Force Base. To 
date, only minor correctable problems have cropped up; and. If this trend 
continues, CARE III will be released within a year. 

The development of a Generally applicable GLOSS computer program, which 
embodies the GLOSS methodology. Is currently underway. Initially, the GLOSS 
will execute on the VAX-11, 700 series computers but will be written for 
maximum computer portability. 


* 




w 


12 




REFERENCES 



1. Mathur, F. P. : Reliability Study of Fault-Tolerant Computers In 
Supporting Research and Advanced Development. Jet Propulsion 
Laboratory, Aug. 1969, Space Programs Summary 37-58, Vol.III, pp. 

106-113. 

2. Blazek, R. H., et al.: Demonstration of Combined Reliability \ 

Prediction and Verification Techniques to a Typical Flight Control 

System, Vol.I, Development and Application of Tabular System 
Reliability Analysis to the F-lll Pitch Flight Control System. 

Battelle Columbus Laboratories, AFFDL-TR-71-128, Vol.I, (Available from 
AF Flight Dynamics Lab., Wrl ght-Patterson AFB), Oct. 1971. 

3. Roth, J. P., et al.: Phase II of an Architectural Study for a 
Self-Repairing Computer. SAMSO TR-67-106, U.S. Air Force, Nov. 1967. 

(Available from DDC as AD 825460). 

4. Bourlclus, W. 6., et al.: Reliability Modeling Techniques and 
Trade-Off Studies for Self-Repairing Computers. RC 2378, Res-Olv. , IBM 
Corp., Feb. 1969. 

5. Raytheon Company, Sudbury, MA: Reliability Model Derivation of a 

Fault-Tolerant, Dual, Spare-Switching Digital Computer System. NASA 
CR-132441, 1974. 

6. Raytheon Company, Sudbury, MA: An Engineering Treatise on the CARE II 

Dual Mode and Coverage Models. NASA CR-144993, 1976. 

7. Bavuso, S. J.: Impact of Coverage on the Reliability of a Fault 

Tolerant Computer. NASA TN D-7938, 1975. 

8. Ultra- Systems, Inc., Newport Beach, CA: Reconfigurable Computer 
Systems Study. NASA CR- 132537, 1974. 

9. Bjurman, B. E., et al.: Airborne Advanced Reconfigurable Computer 

System (ARCS). The Boeing Commercial Airplane Company. NASA CR-145024, 

1976. 

10. O'Neill, E. J.; and Halverson, J. R.: Study of Intermittent Field 

Hardware Failure Data in Digital Electronics. Sperry Univac Defense 
Systems, St. Paul, MN. NASA CR-159268, 1980. 

11. Masson, G. M. : Executive Summary - Intermittent/Transient Faults in 

Computer Systems. The Johns Hopkins University. NASA CR-159229, 1979. 

12. Nagel, P. M.: Software Reliability: Repetitive Run Experimentation 

and Modeling. Boeing Computer Services Company. NASA CR-165836, 1982. 

13. Stiffler, J. J., et al.: CARE III Final Report Phase 1, Vols. 1 and 2. 

Raytheon Co., Sudbury, MA. NASA CR-159122 and NASA CR-159123, 1979. 


13 


14. Bavuso, S. J.: Trends In Reliability Modeling Technology for Fault 
Tolerant Systems. AGARD Conf. Proc. No. 261 on Avionics Reliability, 

Its Techniques and Related Disciplines, April 1979. 

15. Stlffler, J. J.; and Bryant, L. A.: CARE III Phase II Report, 
Mathematical Description. Raytheon Co., Sudbury, MA. NASA CR-3566, 
1982. 

16. Hopkins, A. L.; and Smith, T. B.: The Architectural Elements of a 
Symmetric Fault-Tolerant Multlproccesor. IEEE Trans, on Computers , 

Vol. C-24, No.5., 1975. 

17. Osder, S.: The DC-9-80 Digital Flight Guidance System's Monitoring 
Techniques. Sperry Flight Systems, Phoenix, AZ, AIAA Paper 79-1704, 
1979. 

18. O'Hem, E. A.: Space Shuttle Avionics Redundancy Management. Rockwell 
International, AIAA Digital Avionics Systems Conference, April 1975. 

19. Hensley, J. H., et al . : Design Stu<Jy of Software-Implemented Fault 
Tolerance (SIFT) Computer, SRI International, Menlo Park, CA. NASA 
CR-3011 , 1978. 

« 

20. Stlffler, J. J.: Fault Coverage and the Point of Diminishing Returns. 
Journal of Design Automation and Fault Tolerant Computing, Vol .2, 

No. 4, Oct. 1978. 

21. Trlvedi, K. S.; and Geist, R. M.: A Tutorial on the CARE III Approach 
to Reliability Modeling. Duke University, Durham, NC. NASA CR-3488, 
1981. 

22. Nagel, P. M.: Modeling of a Latent Fault Detector In a Digital System. 

Vought Corp., Hampton, VA. NASA CR-145371, 1978. 

23. McGough, J. G.; and Swern, F. L.: Measurement of Fault Latency in a 
Digital Avionic Mini processor. Bendix Corp., Teterboro, NJ. NASA 
CR-3462, 1981. 

24. McGough, J. G.; et al.: Methodology for Measurement of Fault Latency 

in a Digital Avionic Mini processor, AGARD Conf. Proc. No. 303 on 
Tactical Airborne Distributed Computing and Networks, June 1981. 

25. Bavuso, S. J., et al.: Latent Fault Modeling and Measurement 

Methodology for Application to Digital Flight Controls. Advanced Flight 
Control Symposium, USAF Academy, Colorado Springs, CO, Aug. 1981. 


14 







H 

*d 

•d 


H 

H 


X 

a 


I 

h 

*r| 

•H 

to 

— o 


•rj 

i 

•r» 

X 

o 

Pi 

•r* 

Pi 

n 


w 

E< 

w 

s 


o 

o 


1 

H 

64 

fc 

O 

H 

64 

a 

R 

M 

Q 


•4 

H 

9 

n 

i 

a 

o 

H 

64 

a 

6 « 

w 

Q 

9 

pi 

co 

s 

H 

Ei 

o 

» 

r*i 

M 

Q 


>* 

fs 

►9 

3 

s 

s 

Of 

$5 

O 

H 

64 

0 

e 

9 

5- 

H 

H 

H 

b 

M 

1 

o 

53 

§ 


« 

B 

S 

8 

§ 

H 

O 

O 

CO 

CO 


H 

M 

H 

9 

§ 

« 

Pi 

o 

H 

§ 

O 

CO 

H 


w 

§ 

ss 

8 

I 

CO 

H 


>4 

64 

H 

i4 

H 

i 

S 

P« 

9 


0 

1 


a 


o 


D 

n 

D 

0 

a 

a 

a 

1 

a 



X 

X 


•» 





— 

(0 

m 

•H 

•H 

•h 

f 4 

•rl 

t- 

If 

Pi 

t* 

P« 

Pi 

A 

M 

b» 

0 

o 

0 

o 

0 

0 

0 

9 

o 


Figure 2. - CARE II coverage model 


3 


ftfM&NAI. PAGL t* 

OF POOR QUW-m 


CO 

23 

y 


£ 

<c 


e 


CD 

►H 

co 

UJ 

« 

Q 

s: 

-ac 

CO 

•— I 

CO 

5 

<c 

§ 


& 

LL) 

CO 

CD 

CL. 

OS 


UJ 

tvj 


F= s; 


S 3 

CD 


CO 

LU 

£ 

CO 

u. 

o 

o 

5 


UJ 

CD 

C£ 


b 

g 

£ s 


fc 

or 

CD 

co 

Lul 

Q 

CD 

»«H 

CO 


s 

o: 

cu 


s 

CO 

<C CO 

cq s: 

i °° 

D H-1 

llj ^ 

o rc 

SI CD 
UJ 
CD £ 

•— » >- 
_J 02 
Q 

O 
CD 


CO 

o 

y y» 

a 

£ 

72 

O 


<o 

. «o O CO 

1 — — oc: i — 

2 - JQi J 


feSSb 11 -^^ 

■ — — -ac h- 


co 


co 


UJ 


ses 


(/) ijlj o: 2 : ui o < 
p*j 1“ ^ Q CO D 

§ 


ni 


3 2 


02 

UJ 


se 


ID 

«=c 

IJU 

LU 

O 


OC 

<c 


CO 

S' 

i 

§ 


^5 


O 

CD 

CD 


PQ 


CD 

CO 


£ 

UJ 

CD o 
CD 


Figure 3. - CARE in salient features 






















COMPUTER AIDED RELIABILITY ESTIMATION 

(CARE III) 



Figure 5, - CARE II! user process 











ORIGINAL PAGE IS 
OF POOR QUALITY 




o-l 

l. 


o 




a 


8 

M 

£ 


I hi. M 


i 

H 

sa 



C r- 

M £ CO <H *4 

“ _ u> <rt 

wAIO HHH 

W **P1 *4t4(U 


Ok 

co 

_ in 

tO'T 

w co 

piwn 

CO?* 

« 4 tD 

HHH 

(UftJO 

(or- co 


i 

j 


Ftgura 7. • CARE lit Input for tritrarollable fault-tolerant jwltl processor, 













Y-HIS 
- V-3 


Figure 9, • CARE III fault tree input for PAS function. 








Figure 11. - Advanced reconflgurable flight control systoa. 


















OF 





origmal 

OF poor. 


PAGE lS 
r M At!TY 


C* »*«* f* I V'V'V’I 

*•> • * 

I _JL 

'0 9 ° 


t?h?r a 


^(0 " probability of being In state A at time t 
*j»(0 • transfer rate from state J to state A 

- I ^AjtO 

J 

"* (t) * Tr^ 9 stSe° C ] #t S SaV " Mch * " COvera9 * ">»*< cause a transfer 


The system reliability Is given by 


fnr the sot 


L of allowable states. 


*ct> - J f t {t) 

AeL 


Figure 14. - CARE III state probability computation. 



latency data 



Figure 15. - Latent- fault measurement. 







ORIGINAL PAGE IS 
OF POOR QUALITY 


COnPARtSON-nONITORlN6 



Figure 17. - Fault latency distribution - component-level faults 



Figure 18 . - Summary and results, 


